Refactoring of pdf_extract.py script
#114
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR refactors the
pdf_extract.pyscript to improve readability and maintainability of the code.In order not to affect the current code, the
app.pyscript and theapp_toolslibrary have been created.app.pyperforms the same process aspdf_extract.py.The
app_toolslibrary incorporates the refactorings of the different steps.app_tools
|- pdf.py
|- layout_analysis.py
|- formula_analysis.py
|- ocr_analysis.py
|- table_analysis.py
|- visualize.py
|- config.py
|- utils.py
If you find it interesting you can replace
app.pywithpdf_extract.pyMotivation:
I love the project, I would like to thank you for the great work done.
Refactoring is done to continue working to create an api with fastAPI and Docker.
Main changes:
app.pyhas been created with the pipeline ofpdf_extract.py.app_toolshas been created that contains the classes and methods to perform each step of the pipeline.pdf.py: Provides a set of app_tools for working with PDF files.layout_analysis.py: Analyzes the layout of documents by detecting the layout of each page in a document image.formula_analysis.py: Is designed to handle formula detection and recognition in images.ocr_analysis.py: OCR Processor. It is responsible for performing OCR recognition.table_analysis.py: Represents a Table Processor that is used for table recognition in documents.visualize.py: It generates visualizations of the document layoutconfig.py: Configure model parameters and logsutils.py: save results in jsonFunctionality impact: No change to existing functionality is expected, as the refactoring does not introduce new features or modify existing ones.
Instructions for Reviewers:
app.pyandapp_toolsscripts to ensure that the logic has been ported correctly.Example of Use: